A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
# Library required to suppress any warning messages
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# Library to split data
from sklearn.model_selection import train_test_split
# To build linear model for statistical analysis and prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
# Library to build Decision Tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Library to tune different decision tree models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
precision_recall_curve,
roc_curve,
make_scorer,
)
#Connecting Google drive with Google colab
# Reading the data-set into Google colab
from google.colab import drive
drive.mount('/content/drive')
#Reading the "INNHotelsGroup.csv" dataset into a dataframe (i.e.loading the data)
path="/content/drive/My Drive/INNHotelsGroup.csv"
innhotel = pd.read_csv(path)
# creating a copy of the dataset by copying data to another variable to avoid any changes to original data
data = innhotel.copy()
# returning the first 5 rows using the dataframe head method
data.head()
# returning the last 5 rows using dataframe tail method
data.tail()
#checking shape of the dataframe to find out the number of rows and columns using the dataframe shape command
print("There are", data.shape[0], 'rows and', data.shape[1], "columns.")
# Using the dataframe info() method to print a concise summary of the DataFrame
data.info()
Observation
The dataset contains 19 series (columns) of which one of the series is of the float datatype (avg_price_per_room),thirteen(13) of the series are of the integer datatype (no_of_adults, no_of_children, no_of_weekend_nights, no_of_week_nights, required_car_parking_space, lead_time, arrival_year, arrival_month, arrival_date, repeated_guest, no_of_previous_cancellations, no_of_previous_bookings_not_canceled, and no_of_special_requests) and five(5) of the series are of the object datatype (Booking_ID, type_of_meal_plan, room_type_reserved, market_segment_type, and booking_status).
Total memory usage is approximately 5.3+ MB.
# checking the statistical summary of the data using describe command and transposing.
data.describe().T
Observation
There are 36275 observations present in all
Differences between mean and median values indicate skewness in the data
The average number of days between the date of booking and the arrival date is 85 days. 25% of the guest has a lead time less than 17days,50% has lead time below 57 days, the maximum lead time by a customer is 443 days. This indicates that most of the guest takes time to arrive after booking.
The maximum number of adults that booked a room for reservation is 4, 75% of the bookings by guest has adults less than 2 This indicates that most of the bookings are made for an individual.
The maximum number of previous cancelation by a customer is 13.
The maximum number of previous bookings not canceled by a customer is 58
The average price per day of the reservation is 103.4 euros. 25% of the guests paid less than 80.3 euros per reservation, 50% of the guest paid below 99.45 euros per reservation, while 75% of the guest paid below 120 euros, the maximum payment per reservation is 540 euros.
The maximum number of special request ever made by a guest is 5 special requests
# Checking for missing values
data.isnull().sum()
Observation No null values present, therefore no missing values listed
data.nunique()
This gives an idea about the number of unique values in each column
# checking for duplicate values
data.duplicated().sum()
Observation
** Drop Booking Id since is just an identifier
# Remove column name 'A'
data= data.drop(['Booking_ID'], axis=1)
data.head()
Leading Questions:
from matplotlib import patches
import random
#creating a histogram and Boxplot using function
def histobox_plot(df, column, figsize=(15, 10), kde=False, bins=None):
#plt.figure(figsize = (20,10))
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above)
sns.set(style="darkgrid")
# creating a figure composed of two matplotlib.Axes objects (ax_box and ax_hist)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)},figsize=figsize,)
# assigning a graph to each ax
sns.boxplot(df, x=column, ax=ax_box,showmeans=True, color="violet")
sns.histplot(data=df, x=column, ax=ax_hist)
ax_hist.axvline(
data[column].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist.axvline(
data[column].median(), color="black", linestyle="-"
) # Add median to the histogram
# Remove x axis name for the boxplot
ax_box.set(xlabel='')
for p in ax_hist.patches:
height = p.get_height() # get the height of each bar
# adding text to each bar
ax_hist.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
y = height+0.2, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}'.format(height), # data label, formatted to ignore decimals
ha = 'center') # sets horizontal alignment (ha) to center
histobox_plot(data, "lead_time")
Observation
histobox_plot(data, "avg_price_per_room")
Observation
data[data["avg_price_per_room"] == 0]
data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()
flooring and caping the outliers in avg_price_per_room
# Calculating the 25th quantile
Q1 = data["avg_price_per_room"].quantile(0.25)
# Calculating the 75th quantile
Q3 = data["avg_price_per_room"].quantile(0.75)
# Calculating IQR
IQR = Q3 - Q1
# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
Upper_Whisker
# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker
histobox_plot(data, "no_of_previous_cancellations")
histobox_plot(data, "no_of_previous_bookings_not_canceled")
# function to create barplots for automation
def barplot(data, column,perc=True):
plt.figure(figsize=(10,5))
bxp=sns.countplot(data=data,x=column)
bxp.set_xlabel(column, fontsize=14)
bxp.axes.set_title("Bar Chart Plot of "+ column.upper(), fontsize=16)
plt.xticks(rotation=90)
# label each bar in the countplot
for p in bxp.patches:
total = len(data[column]) # length of the column
height = p.get_height()
# get the height of each bar
# percentage of each class of the category # get the height of each bar
# adding text to each bar
bxp.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
y = height+0.2, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}'.format(height), # data label, formatted to ignore decimals
ha = 'center') # sets horizontal alignment (ha) to center
for p in bxp.patches:
total = len(data[column]) # length of the column
height2 = 100 * p.get_height() / total
# get the height of each bar
# percentage of each class of the category # get the height of each bar
# adding text to each bar
bxp.text(x = p.get_x()+(p.get_width()), # x-coordinate position of data label, padded to be in the middle of the bar
y = height2+0.4, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}'.format(height2)+"%", # data label, formatted to ignore decimals
ha = 'left') # sets horizontal alignment (ha) to center
plt.show()
barplot(data, "no_of_adults", perc=True)
The distribution of no. of adults show that it is not equally distributed.
The most the bookings by guest are for two adults 26108(72%) of the number of adults
barplot(data, "no_of_children", perc=True)
The distribution of no. of children shows that it is not equally distributed.
The most the bookings by guest does not involve children
# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)
barplot(data, "no_of_week_nights", perc=True)
The distribution of the number of week nights show that it is not equally distributed.
Most of the guests that spent nights in the hotel during the week(Monday to Friday) has spent only two week nights i.e. 11444(32%) of the guests. Followed by other guests that has spent one week night during the week days i.e. 9488(26%) of the guests
barplot(data, "no_of_weekend_nights", perc=True)
The distribution of no. of weekend nights show that it is not equally distributed.
Majority of the guests don't spend weekend nights (Saturday and Sunday) in the hotel.
barplot(data, "required_car_parking_space", perc=True)
The distribution of no. of guests that requires car parking space shows that it is not equally distributed.
Most of the guests i.e. 35151(97%) don't require parking space, this indicates that most of the guest don't come with theiir private cars.
barplot(data, "type_of_meal_plan", perc=True)
barplot(data, "room_type_reserved", perc=True)
The distribution on the type of room reserved for guest shows that it is not equally distributed.
Most of the guests 28130(78%) normally reserve room type 1
barplot(data, "arrival_month", perc=True)
The distribution on no. of arrivals per month shows that it is not equally distributed.
The month october has the highest number of arrivals by guest 5317.
barplot(data, "market_segment_type", perc=True)
The distribution of market segment type shows that it is not equally distributed.
Most of the bookings are done by those in the online market segment, i.e. 23214(64%)
barplot(data, "no_of_special_requests", perc=True)
The distribution of no. of special requests shows that it is not equally distributed.
Majority 19777(55%) of the guest don't usually make any special request from the hotel.
Only 8 of the guests has made special request ftom the hotel for five times.
barplot(data, "booking_status", perc=True)
The distribution on booking status shows that it is not equally distributed.
Most of the bookings made by guest 24390(67% are not cancelled) while 11885(33%) of the bookings where cancelled.
Labelling Canceled bookings to 1 and Not_Canceled as 0 for further analysis
data["booking_status"] = data["booking_status"].apply(
lambda x: 1 if x == "Canceled" else 0
) #Apply lambda function, whereby if x is equal to "Canceled" label it 1 else label it zero (0)
#Using heatmap to check correlation between coloumns
cols_list = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(12, 7))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
Busiest months in the hotel.
# grouping the data on arrival months and extracting the count of bookings using the group by method
monthly_data = data.groupby(["arrival_month"])["booking_status"].count()
# creating a dataframe with months and count of customers in each month
monthly_data = pd.DataFrame(
{"Month": list(monthly_data.index), "Guests": list(monthly_data.values)}
)
# plotting the variations for different months
plt.figure(figsize=(10, 5))
sns.lineplot(data=monthly_data, x="Month", y="Guests")
plt.show()
Which market segment do most of the guests come from?
stacked_barplot(data, "market_segment_type", "booking_status")
Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
plt.figure(figsize=(10, 6))
sns.boxplot(
data=data, x="market_segment_type", y="avg_price_per_room", palette="gist_rainbow"
)
plt.show()
What percentage of bookings are canceled?
stacked_barplot(data,'arrival_month','booking_status')
5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
stacked_barplot(data,'repeated_guest','booking_status')
Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
stacked_barplot(data,'no_of_special_requests','booking_status')
We saw earlier that there is a positive correlation between booking status and average price per room. Let's analyze it
distribution_plot_wrt_target(data, "avg_price_per_room", "booking_status")
There is a positive correlation between booking status and lead time also. Let's analyze it further
distribution_plot_wrt_target(data, "lead_time", "booking_status")
Generally people travel with their spouse and children for vacations or other activities. Let's create a new dataframe of the customers who traveled with their families and analyze the impact on booking status.
family_data = data[(data["no_of_children"] >= 0) & (data["no_of_adults"] > 1)]
family_data.shape
family_data["no_of_family_members"] = (
family_data["no_of_adults"] + family_data["no_of_children"]
)
stacked_barplot(family_data,"no_of_family_members","booking_status" )
Let's do a similar analysis for the customer who stay for at least a day at the hotel.
stay_data = data[(data["no_of_week_nights"] > 0) & (data["no_of_weekend_nights"] > 0)]
stay_data.shape
stay_data["total_days"] = (
stay_data["no_of_week_nights"] + stay_data["no_of_weekend_nights"]
)
stacked_barplot(stay_data,"total_days","booking_status" )
As hotel room prices are dynamic, Let's see how the prices vary across different months
plt.figure(figsize=(10, 5))
sns.lineplot(data, x="arrival_month", y="avg_price_per_room")
plt.show()
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# dropping booking_status
numeric_columns.remove("booking_status")
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
No need of treating the outliers because Logistic Regression models are not much impacted due to the presence of outliers because the sigmoid function tapers the outliers most often.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]
# let's add the intercept to data
X = sm.add_constant(X)
# Creating dummy variables for categorical features
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
We have an approximately balanced classes of the dependent variable in both train and test sets
Predicting a customer will cancel a booking but in reality the customer did not cancel. Predicting a customer will not cancel a booking but in reality the customer canceled the booking.
Both the cases are of concern as:
If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs of distribution channels.
If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage the brand equity.
F1 Score to be maximized, greater the F1 score higher are the chances of minimizing False Negatives and False Positives. # fitting logistic regression model
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit()
print(lg.summary())
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
Observations
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(X_train)
The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.
# initial list of columns
cols = X_train.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = X_train[cols]
# fitting the model
model = sm.Logit(y_train, x_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
X_train1 = X_train[selected_features]
X_test1 = X_test[selected_features]
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit()
print(lg1.summary())
print("Training performance:")
model_performance_classification_statsmodels(lg1,X_train1, y_train)
# converting coefficients to odds
odds = np.exp(lg1.params)
# finding the percentage change
perc_change_odds = (np.exp(lg1.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns).T
Holding all other features a constant, a change in either one of the features no of adults, no. of children, no. of weekends night, no of week nights, lead time, arival year, number of previous cancelations,average price per room, no. of special request, type of meal plan 2, type of meal plan not selected, will increase the odds of a customer cancelling a booking.
Likewise, holding all other features a constant, a change in either one of the features required car parking spcae, arrival month, repeated guest, no. of special requests, room type 2 reserved, room type 4 reserved, room type 5 reserved, room type 6 reserved, room type 7 reserved, Corporate market segment,and Offline market segment will decrease the odds of a customer cancelling a booking.
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train1, y_train)
print("Training performance:")
log_reg_model_train_perf = model_performance_classification_statsmodels(lg1,X_train1, y_train)
log_reg_model_train_perf
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where true positive rate (tpr) is high and and false positive rate (fpr) is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Observations
Recall and F1 Score of model has increased while the other metrics Accuracy and Precision have reduced The model is still giving a good performance
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
At threshold 0.42 we get a balanced Recall and Precision
# setting the threshold
optimal_threshold_curve = 0.42
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Using model with default threshold
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1, y_test)
log_reg_model_test_perf = model_performance_classification_statsmodels(lg1, X_test1, y_test)
print("Test performance:")
log_reg_model_test_perf
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(X_test1))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(X_test1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Using model with threshold=0.37
#creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Using model with threshold = 0.42
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_curve)
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
# test performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression Sklearn",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Test performance comparison : ")
models_test_comp_df
Conclusion
X = data.drop(["booking_status"], axis=1)# creating independent variables
Y = data["booking_status"] # creating dependent variables
# Creating dummy variables for categorical features
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.30, random_state=1)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
model = DecisionTreeClassifier(criterion = 'gini', random_state=1)
model.fit(X_train, y_train)
Model performance on training set
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
Observations
The decision tree is almost fully grown i.e hence the model is overfit and it is able to classify almost all the data points on the training set with no errors
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(model, X_test, y_test)
decision_tree_perf_test
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
According to the decision tree model before pruning, Lead time is the most important variable for predicting if a customer will cancel a booking followed by average price per room.
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
Model performance improvement
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_tune_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_tune_perf_train
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_tune_perf_test = model_performance_classification_sklearn(model, X_test, y_test)
decision_tree_tune_perf_test
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
Observations
If the guest has a lead time >90.50, no. of special request < 0.5, arrival month > 11.5, and average room price more than 93.58, the guest is more likely to cancel.
# importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observatons
In tuned decision tree lead time and market segment of the online type are the most important features, followed by number of special request.
Cost Complexity Pruning
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
The maximum value of F1 score is around 0.0001 alpha, for both train and test sets
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
confusion_matrix_sklearn(best_model, X_train, y_train)
decision_tree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
decision_tree_post_perf_train
confusion_matrix_sklearn(best_model, X_test, y_test)
decision_tree_post_test = model_performance_classification_sklearn(
best_model, X_test, y_test
)
decision_tree_post_test
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
models_test_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree Sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test performance comparison : ")
models_test_comp_df
Conclusion Decision tree model with pre-pruning has given the best recall score on training data Decision tree model with post-pruning has given the best recall score on test set The tree with pre pruning is not complex and is easy to interpret
Model comparison Logistic Regression
Holding all other features a constant, a change in either one of the features no of adults, no. of children, no. of weekends night, no of week nights, lead time, arival year, number of previous cancelations,average price per room, no. of special request, type of meal plan 2, type of meal plan not selected, will increase the odds of a customer cancelling a booking.
Likewise, holding all other features a constant, a change in either one of the features required car parking spcae, arrival month, repeated guest, no. of special requests, room type 2 reserved, room type 4 reserved, room type 5 reserved, room type 6 reserved, room type 7 reserved, Corporate market segment,and Offline market segment will decrease the odds of a customer cancelling a booking.
Decision Tree
**The Hotel should for policy for canceling booking for guest that meets the following criteria
that booked a room with average room price more than 93.58 This is because they are more likely to cancel.
The hotel can form it as a terms and condition to the intending guest, such that any time the meet the above criteria, they should know that they will be refunded with a charge, and their booking canceled, else they need to pay more to retain the booking.